63 research outputs found
Empirical Evaluation of Mutation-based Test Prioritization Techniques
We propose a new test case prioritization technique that combines both
mutation-based and diversity-based approaches. Our diversity-aware
mutation-based technique relies on the notion of mutant distinguishment, which
aims to distinguish one mutant's behavior from another, rather than from the
original program. We empirically investigate the relative cost and
effectiveness of the mutation-based prioritization techniques (i.e., using both
the traditional mutant kill and the proposed mutant distinguishment) with 352
real faults and 553,477 developer-written test cases. The empirical evaluation
considers both the traditional and the diversity-aware mutation criteria in
various settings: single-objective greedy, hybrid, and multi-objective
optimization. The results show that there is no single dominant technique
across all the studied faults. To this end, \rev{we we show when and the reason
why each one of the mutation-based prioritization criteria performs poorly,
using a graphical model called Mutant Distinguishment Graph (MDG) that
demonstrates the distribution of the fault detecting test cases with respect to
mutant kills and distinguishment
Effective Removal of Operational Log Messages: an Application to Model Inference
Model inference aims to extract accurate models from the execution logs of
software systems. However, in reality, logs may contain some "noise" that could
deteriorate the performance of model inference. One form of noise can commonly
be found in system logs that contain not only transactional messages---logging
the functional behavior of the system---but also operational
messages---recording the operational state of the system (e.g., a periodic
heartbeat to keep track of the memory usage). In low-quality logs,
transactional and operational messages are randomly interleaved, leading to the
erroneous inclusion of operational behaviors into a system model, that ideally
should only reflect the functional behavior of the system. It is therefore
important to remove operational messages in the logs before inferring models.
In this paper, we propose LogCleaner, a novel technique for removing
operational logs messages. LogCleaner first performs a periodicity analysis to
filter out periodic messages, and then it performs a dependency analysis to
calculate the degree of dependency for all log messages and to remove
operational messages based on their dependencies. The experimental results on
two proprietary and 11 publicly available log datasets show that LogCleaner, on
average, can accurately remove 98% of the operational messages and preserve 81%
of the transactional messages. Furthermore, using logs pre-processed with
LogCleaner decreases the execution time of model inference (with a speed-up
ranging from 1.5 to 946.7 depending on the characteristics of the system) and
significantly improves the accuracy of the inferred models, by increasing their
ability to accept correct system behaviors (+43.8 pp on average, with
pp=percentage points) and to reject incorrect system behaviors (+15.0 pp on
average)
Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems
Deep Neural Networks (DNNs) have been widely used to perform real-world tasks
in cyber-physical systems such as Autonomous Diving Systems (ADS). Ensuring the
correct behavior of such DNN-Enabled Systems (DES) is a crucial topic. Online
testing is one of the promising modes for testing such systems with their
application environments (simulated or real) in a closed loop taking into
account the continuous interaction between the systems and their environments.
However, the environmental variables (e.g., lighting conditions) that might
change during the systems' operation in the real world, causing the DES to
violate requirements (safety, functional), are often kept constant during the
execution of an online test scenario due to the two major challenges: (1) the
space of all possible scenarios to explore would become even larger if they
changed and (2) there are typically many requirements to test simultaneously.
In this paper, we present MORLOT (Many-Objective Reinforcement Learning for
Online Testing), a novel online testing approach to address these challenges by
combining Reinforcement Learning (RL) and many-objective search. MORLOT
leverages RL to incrementally generate sequences of environmental changes while
relying on many-objective search to determine the changes so that they are more
likely to achieve any of the uncovered objectives. We empirically evaluate
MORLOT using CARLA, a high-fidelity simulator widely used for autonomous
driving research, integrated with Transfuser, a DNN-enabled ADS for end-to-end
driving. The evaluation results show that MORLOT is significantly more
effective and efficient than alternatives with a large effect size. In other
words, MORLOT is a good option to test DES with dynamically changing
environments while accounting for multiple safety requirements
Identifying the Hazard Boundary of ML-enabled Autonomous Systems Using Cooperative Co-Evolutionary Search
In Machine Learning (ML)-enabled autonomous systems (MLASs), it is essential
to identify the hazard boundary of ML Components (MLCs) in the MLAS under
analysis. Given that such boundary captures the conditions in terms of MLC
behavior and system context that can lead to hazards, it can then be used to,
for example, build a safety monitor that can take any predefined fallback
mechanisms at runtime when reaching the hazard boundary. However, determining
such hazard boundary for an ML component is challenging. This is due to the
problem space combining system contexts (i.e., scenarios) and MLC behaviors
(i.e., inputs and outputs) being far too large for exhaustive exploration and
even to handle using conventional metaheuristics, such as genetic algorithms.
Additionally, the high computational cost of simulations required to determine
any MLAS safety violations makes the problem even more challenging.
Furthermore, it is unrealistic to consider a region in the problem space
deterministically safe or unsafe due to the uncontrollable parameters in
simulations and the non-linear behaviors of ML models (e.g., deep neural
networks) in the MLAS under analysis. To address the challenges, we propose
MLCSHE (ML Component Safety Hazard Envelope), a novel method based on a
Cooperative Co-Evolutionary Algorithm (CCEA), which aims to tackle a
high-dimensional problem by decomposing it into two lower-dimensional search
subproblems. Moreover, we take a probabilistic view of safe and unsafe regions
and define a novel fitness function to measure the distance from the
probabilistic hazard boundary and thus drive the search effectively. We
evaluate the effectiveness and efficiency of MLCSHE on a complex Autonomous
Vehicle (AV) case study. Our evaluation results show that MLCSHE is
significantly more effective and efficient compared to a standard genetic
algorithm and random search
Systematic Evaluation of Deep Learning Models for Failure Prediction
With the increasing complexity and scope of software systems, their
dependability is crucial. The analysis of log data recorded during system
execution can enable engineers to automatically predict failures at run time.
Several Machine Learning (ML) techniques, including traditional ML and Deep
Learning (DL), have been proposed to automate such tasks. However, current
empirical studies are limited in terms of covering all main DL types --
Recurrent Neural Network (RNN), Convolutional Neural network (CNN), and
transformer -- as well as examining them on a wide range of diverse datasets.
In this paper, we aim to address these issues by systematically investigating
the combination of log data embedding strategies and DL types for failure
prediction. To that end, we propose a modular architecture to accommodate
various configurations of embedding strategies and DL-based encoders. To
further investigate how dataset characteristics such as dataset size and
failure percentage affect model accuracy, we synthesised 360 datasets, with
varying characteristics, for three distinct system behavioral models, based on
a systematic and automated generation approach. Using the F1 score metric, our
results show that the best overall performing configuration is a CNN-based
encoder with Logkey2vec. Additionally, we provide specific dataset conditions,
namely a dataset size >350 or a failure percentage >7.5%, under which this
configuration demonstrates high accuracy for failure prediction
PRINS: Scalable Model Inference for Component-based System Logs
Behavioral software models play a key role in many software engineering tasks;
unfortunately, these models either are not available during software development
or, if available, quickly become outdated as implementations evolve.
Model inference techniques have been proposed as a viable solution to extract
finite state models from execution logs. However, existing techniques do not
scale well when processing very large logs that can be commonly found in
practice.
In this paper, we address the scalability problem of inferring the model of a
component-based system from large system logs, without requiring any extra
information. Our model inference technique, called PRINS, follows a divide-and-conquer
approach. The idea is to first infer a model of each system component
from the corresponding logs; then, the individual component models are merged
together taking into account the flow of events across components, as reflected in
the logs. We evaluated PRINS in terms of scalability and accuracy, using nine
datasets composed of logs extracted from publicly available benchmarks and a
personal computer running desktop business applications. The results show that
PRINS can process large logs much faster than a publicly available and well-known
state-of-the-art tool, without significantly compromising the accuracy of
inferred models
Digital Twins Are Not Monozygotic -- Cross-Replicating ADAS Testing in Two Industry-Grade Automotive Simulators
The increasing levels of software- and data-intensive driving automation call
for an evolution of automotive software testing. As a recommended practice of
the Verification and Validation (V&V) process of ISO/PAS 21448, a candidate
standard for safety of the intended functionality for road vehicles,
simulation-based testing has the potential to reduce both risks and costs.
There is a growing body of research on devising test automation techniques
using simulators for Advanced Driver-Assistance Systems (ADAS). However, how
similar are the results if the same test scenarios are executed in different
simulators? We conduct a replication study of applying a Search-Based Software
Testing (SBST) solution to a real-world ADAS (PeVi, a pedestrian vision
detection system) using two different commercial simulators, namely,
TASS/Siemens PreScan and ESI Pro-SiVIC. Based on a minimalistic scene, we
compare critical test scenarios generated using our SBST solution in these two
simulators. We show that SBST can be used to effectively and efficiently
generate critical test scenarios in both simulators, and the test results
obtained from the two simulators can reveal several weaknesses of the ADAS
under test. However, executing the same test scenarios in the two simulators
leads to notable differences in the details of the test outputs, in particular,
related to (1) safety violations revealed by tests, and (2) dynamics of cars
and pedestrians. Based on our findings, we recommend future V&V plans to
include multiple simulators to support robust simulation-based testing and to
base test objectives on measures that are less dependant on the internals of
the simulators.Comment: To appear in the Proc. of the IEEE International Conference on
Software Testing, Verification and Validation (ICST) 202
Quality Based Software Project Staffing and Scheduling with Budget and Deadline
Abstract-Software project planning is becoming more complicated and important as the size of software project grows. Many approaches have been proposed to help project managers by providing optimal staffing and scheduling in terms of minimizing the salary cost or duration. Unfortunately, the software quality, another critical factor in software project planning, is largely overlooked in previous work. In this paper, we propose the quality based software project staffing and scheduling approach. We provide better software project plans considering quality with either cost bound (Budget) or duration bound (Deadline) for software project managers
- …